Inspect data

Wordclouds

Before pre-processing:

Pre-processed Data:

TF-IDF

The statistic tf-idf (term frequency - inverse document frequency) is intended to measure how important a word is to a document in a collection (or corpus) of documents.

The inverse document frequency for any given term is defined as

\[ idf\text{(term)}=\frac{n_{\text{documents}}}{n_{\text{documents containing term}}} \]

Build Corpus

Structural Topic Model

Parties want the media agenda to be congruent with their own agenda to define the issue-based criteria on which they will be evaluated by voters ( Eberl, 2017 ).Thus, parties instrumentalize their press releases in order to highlight issues that they are perceived to be competent on, that they “own” and that are important to their voters ( Kepplinger & Maurer, 2004 ). Editors can select from this universe and decide which of these topics will be discussed in the news. In that sense the ideological content of a newspaper refers to the extent to which the topics promoted by the parties correlate with the topics discussed in the news articles.

To discover the latent topics in the corpus of press releases (1.942) and news articles (11.880), a structural topic modeling (STM) developed by Roberts (2016) is applied. The STM is an unsupervised machine learning approach that models topics as multinomial distributions of words and documents as multinomial distributions of topics, allowing to incorporate external variables that effect both, topical content and topical prevalence.

Select Model

STM assumes a fixed user-specified number of topics. There is not a “right” answer to the number of topics that are appropriate for a given corpus (Grimmer and Stewart 2013). Roberts et al. (2016) propose to measure topic quality through a combination of semantic coherence and exclusivity of words to topics. Semantic coherence is a criterion developed by Mimno et al. (2011) and is closely related to pointwise mutual information (Newman et al. 2010): it is maximized when the most probable words in a given topic frequently co-occur together.

Using the function searchK several automated tests are performed to help choose the number of topics including the average exclusivity and semantic coherence as well as the held out likelihood (Wallach et al. 2009) and the residuals (Taddy 2012).

Run Model

I included the document source as a control for the topical topical prevalence, assuming that the distribution of topics depends on the sources. The number of topics is set to 50.

Results

library(stm)
library(tidyverse)
library(ggthemes)
library(xtable)
library(viridis)

rm(list = ls())
load("../output/models/finalmodel_50.RDa")

model_df <- model_df %>%
  dplyr::mutate(doc_index = as.numeric(rownames(.)),
         source = ifelse(source == "welt.de", "DIE WELT", source),
         source = ifelse(source == "zeit.de", "ZEIT ONLINE", source),
         source = ifelse(source == "focus.de", "FOCUS Online", source),
         source = ifelse(source == "bild.de", "Bild.de", source),
         source = ifelse(source == "spiegel.de", "SPIEGEL ONLINE", source),
         
         source = ifelse(source == "union", "Union", source),
         source = ifelse(source == "spd", "SPD", source),
         source = ifelse(source == "afd", "AfD", source),
         source = ifelse(source == "gruene", "Grüne", source),
         source = ifelse(source == "linke", "Linke", source),
         source = ifelse(source == "fdp", "FDP", source)
         )

Label topics

To explore the words associated with each topic we use the words with the highest probability in each topic. As we included the source type (press release or news paper) as a control for the topical content (the word distribution of each topic), we have two different labels for each topic.

sagelabs <- sageLabels(stmOut)
theta <- as.data.frame(stmOut$theta) %>% # get all theta values for each document
  
  mutate(doc_index = as.numeric(rownames(.))) %>%
  # convert to long format
  gather(topic, theta, -doc_index) %>%
  mutate(topic = as.numeric(gsub("V","",topic))) %>%
  
  # join with topic df
  left_join(., topics.df, by="topic") %>%
  
  # join with model_df
  left_join(., model_df %>% 
              select(date,type,source,doc_index,title_text), by="doc_index")

Topic frequency

The expected proportion of the corpus that belongs to each topic is used to get an initial overview of the results. The figure below displays the topics ordered by their expected frequency across the corpus. The four most frequent words in each topic are used as a label for that topic.

overall_freq <- as.data.frame(colMeans(stmOut$theta)) %>%
  transmute(
    topic = as.numeric(rownames(.)),
    frequency = colMeans(stmOut$theta)
         ) %>%
  left_join(., topics.df, by = "topic") %>% 
  arrange(desc(frequency))%>%
  mutate(order = row_number())
overall_freq %>%
  ggplot(aes(reorder(joint_label, -order), 
             frequency, fill=frequency)) +
  geom_col(show.legend = F) +
  coord_flip() +
  scale_fill_gradient(low = "#dee24e", high = "#421c64") +
  theme_hc() +
  labs(x=NULL, y=NULL) 

ggsave("../figs/topic_proportion.png", height = 10, width = 10)

Measure Agendas

Agendas were measured in terms of percentage distributions across the 60 topics. For each source the average distribution of each topic is calculated for each month. The following pictures show the overall topic distribution.

topicmean_news <- theta %>%
  filter(type == "news") %>%
  group_by(topic,joint_label, source) %>%
  summarise(topicmean = mean(theta)) %>% 
  ungroup()

topicmean_press <- theta %>%
  filter(type == "press") %>%
  group_by(topic,joint_label, source) %>%
  summarise(topicmean = mean(theta)) %>% 
  ungroup()
topicmean_news %>%
  group_by(source) %>%
  arrange(desc(topicmean), .by_group = TRUE) %>%
  mutate(topic_order = row_number()) %>%
  ungroup() %>%
  
  group_by(joint_label) %>%
  mutate(topicmean_mean = mean(topicmean)) %>%
  ungroup() %>%
  top_n(70, topicmean_mean) %>%
  
  ggplot(aes(reorder(joint_label, topicmean_mean),
             topicmean, label = topic_order,
             fill = topic_order)) +
  geom_col(show.legend = F) +
  geom_text(hjust=-0.1, size=5) +
  coord_flip() +
  scale_fill_gradient(low = "#00AFBB", high = "#FC4E07") +
  facet_wrap(~source, nrow = 1) +
  labs(x=NULL, y=NULL) +
  theme(axis.text.y = element_text(size=12))

topicmean_press %>%
  group_by(source) %>%
  arrange(desc(topicmean), .by_group = TRUE) %>%
  mutate(topic_order = row_number()) %>%
  ungroup() %>%
  
  group_by(joint_label) %>%
  mutate(topicmean_mean = mean(topicmean)) %>%
  ungroup() %>%
  top_n(50, topicmean_mean) %>%
  
  ggplot(aes(reorder(joint_label, topicmean_mean),
             topicmean, label = topic_order,
             fill=topic_order)) +
  geom_col(show.legend = F) +
  geom_text(hjust=-0.1, size=5) +
  coord_flip() +
  scale_fill_gradient(low = "#dee24e", high = "#421c64") +
  facet_wrap(~source, nrow = 1) +
  labs(x=NULL, y=NULL) +
  theme(axis.text.y = element_text(size=12))

Correlation of topic prevalence

normalize_data <- function(x) {
  # normalize data between -1,1
  if (is.numeric(x)) {
    y <- 2*((x - min(x, na.rm = T)) / (max(x, na.rm = T) - min(x, na.rm = T)))-1
    return(y)
  } else {
    return(x)
  }

}

For each document \(d\), we get a vector \(1\)x\(k\) defined as the document-topic distribution \(\theta_{d}\):

\[ \theta_{d} = \begin{bmatrix} \theta_{1} \\ . \\ . \\ \theta_{k-1} \\ \theta_{k}\\ \end{bmatrix} \]

where \(\theta_1, ... ,\theta_{k-1}, \theta_k\) are the probabilities \(\theta\) of observing topic \(k\) in document \(d\).

E.g. following vector is given for a sample document:

sample_doc <- sample(nrow(model_df),1)

theta %>% 
  filter(doc_index == sample_doc) %>%
  transmute(theta = round(theta, 4)) %>% htmlTable::htmlTable()
theta
1 0.0018
2 0.0018
3 0.0033
4 0.0018
5 4e-04
6 0.0086
7 0.0033
8 8e-04
9 0.0031
10 0.0016
11 6e-04
12 0.0038
13 0.019
14 0.1016
15 4e-04
16 0.0044
17 0.2196
18 0.0014
19 0.0014
20 0.0125
21 0.0013
22 3e-04
23 0.0023
24 0.0013
25 0.0046
26 0.0026
27 0.0313
28 0.001
29 0.0031
30 0.0027
31 5e-04
32 0.4978
33 0.0115
34 0.0019
35 7e-04
36 0.0013
37 0.002
38 6e-04
39 0.0013
40 0.0013
41 0.0012
42 0.0022
43 0.0172
44 0.0047
45 0.0013
46 0.001
47 0.0011
48 4e-04
49 0.0095
50 0.0011
# uncomment this to only select docs from press releases
#sample_doc <- theta %>% filter(type=="press") %>% sample_n(1) %>% select(doc_index)
#sample_doc <- sample_doc$doc_index

title <- model_df$title[which(model_df$doc_index == sample_doc)]
source <- model_df$source[which(model_df$doc_index == sample_doc)]

theta %>%
  filter(doc_index == sample_doc) %>%
  ggplot(aes(reorder(joint_label, desc(topic)), theta)) +
  geom_col(fill="#3d648a", alpha = 0.8) +
  ylim(c(0,1)) +
  coord_flip() +
  theme_hc() +
  labs(title = paste("Topic distribution of document",sample_doc),
       subtitle = paste0("Source: ",source,"\nTitle: ", title),
       x = NULL, y = NULL
       ) +
  theme(axis.text = element_text(size = 10))

#ggsave("../figs/doc_topic_distr.png", height = 10, width = 10)

For each source \(s\), we get a matrix \(\Theta_s\) as the collection of all documents (collection of column vectors \(\theta_{dk}\)).

\[ \Theta_s = \begin{bmatrix} \theta_{1} & ... & \theta_{d} \\ . & . & . \\ . & . & . \\ \theta_{k} & . & . \\ \end{bmatrix} \]

where \(\theta_{j}\) is the \(j\)-th column of \(\Theta_s\) for $j {1, … , d } $. The \(k × 1\) vector \(\theta_j\) gives the \(j\)-th document’s probability for the \(k\) topic.

E.g. for “DIE WELT” the following matrix of document-topic distributions is given:

theta %>% 
  filter(source == "DIE WELT") %>%
  filter(doc_index %in% seq(1,50)) %>%
  select(doc_index, topic, theta) %>%
  mutate(theta = round(theta, 3)) %>%
  spread(doc_index,theta) %>% select(-topic) %>% 
  htmlTable::htmlTable()
1 2 3 4 5 12 14 15 38 40 42
1 0.031 0.007 0.002 0.028 0.035 0.074 0.001 0.036 0 0 0.001
2 0.013 0.001 0.046 0.015 0.02 0.156 0 0.035 0.001 0.001 0.001
3 0.019 0 0.001 0.022 0.014 0.008 0 0.008 0 0 0.001
4 0.001 0 0.001 0.001 0.001 0.002 0 0.001 0 0 0.008
5 0.001 0.001 0 0.001 0.001 0.001 0 0.002 0.001 0.002 0
6 0.006 0 0 0.007 0 0.04 0 0.001 0 0 0
7 0.008 0.002 0.001 0.009 0.009 0.003 0 0.013 0 0 0.001
8 0.029 0 0.005 0.022 0.022 0.006 0 0.013 0 0 0.001
9 0.049 0.025 0.002 0.049 0.06 0.138 0.024 0.073 0 0 0.001
10 0.011 0.109 0.014 0.008 0.009 0.034 0.124 0.014 0.001 0.001 0
11 0.021 0 0.002 0.011 0.009 0.004 0 0.009 0 0 0.004
12 0 0 0 0 0 0.013 0 0 0 0 0
13 0.006 0.003 0 0.006 0.011 0.001 0 0.006 0 0 0
14 0.294 0 0 0.341 0.395 0.034 0 0.55 0 0 0.008
15 0 0 0.008 0 0 0.006 0 0 0 0 0
16 0.014 0.008 0.001 0.013 0.005 0.003 0 0.01 0 0 0
17 0.001 0.001 0.002 0.011 0 0.027 0.001 0 0 0 0.03
18 0 0.001 0.003 0.001 0.001 0.001 0 0.001 0 0 0.011
19 0.005 0.001 0 0.004 0.005 0.001 0 0.006 0 0 0.001
20 0 0.001 0 0.001 0.001 0.004 0 0 0 0 0
21 0.003 0.002 0 0.002 0.002 0.003 0.006 0.005 0 0 0.001
22 0.001 0 0.769 0 0 0.001 0 0 0.985 0.983 0.001
23 0.03 0 0.005 0.037 0.037 0.001 0 0.024 0 0 0.003
24 0 0.007 0.001 0 0 0 0.007 0 0.002 0.002 0
25 0 0 0 0 0 0.002 0 0 0 0 0
26 0.101 0 0.005 0.065 0.047 0.006 0 0.022 0 0 0.015
27 0.009 0.749 0 0.011 0.016 0.001 0.735 0.02 0 0 0
28 0.231 0 0 0.227 0.219 0 0 0.054 0 0 0
29 0 0.007 0 0 0 0 0.005 0 0 0 0
30 0 0 0 0 0 0.001 0 0 0 0 0.321
31 0 0 0 0 0 0 0 0 0.004 0.004 0
32 0 0 0 0 0 0.083 0 0 0 0 0.002
33 0.007 0.001 0.082 0.005 0.002 0.095 0.001 0.002 0.001 0.001 0.131
34 0.057 0.001 0.011 0.04 0.015 0 0 0.013 0 0 0.413
35 0 0 0 0 0 0.017 0 0 0.001 0.001 0
36 0 0 0 0 0 0 0 0 0 0 0
37 0.005 0.001 0 0.006 0.008 0.005 0.003 0.013 0 0 0.003
38 0.007 0 0 0.009 0.011 0.038 0 0.011 0 0 0
39 0.001 0.036 0 0.001 0.002 0.046 0.048 0 0 0 0
40 0 0 0.001 0 0.001 0.003 0 0 0 0 0
41 0.01 0 0.016 0.01 0.007 0.001 0 0.005 0 0 0.001
42 0.002 0.031 0.009 0.004 0.001 0.004 0.032 0.002 0 0 0.002
43 0.001 0.002 0.006 0 0 0.072 0.006 0 0 0 0.009
44 0 0 0 0 0 0.024 0 0 0 0 0.001
45 0 0 0 0 0 0.008 0 0 0 0 0.023
46 0 0 0.001 0 0 0.012 0 0 0.001 0.001 0
47 0 0 0 0 0.001 0.001 0 0.001 0 0 0
48 0.023 0 0 0.029 0.032 0.001 0 0.049 0 0 0
49 0 0 0 0 0 0.017 0 0 0 0 0.003
50 0 0 0 0 0 0.002 0 0 0 0 0

—> Group by source an topic: The mean for each topic is given by

\[ \bar{ \theta_{i} } = \sum^d_{j=1}\theta_{ij} \]

where $i {1, … , k } $

# calculate topic mean by source and month
topicmean <- theta %>%
  mutate(
    year = lubridate::year(date),
    month = lubridate::month(date)
    ) %>%
  filter(month != 5) %>%
  group_by(topic,source) %>%
  dplyr::summarise(topicmean = mean(theta)) %>% 
  ungroup() %>%
  spread(source, topicmean) 

theta %>%
  mutate(
    year = lubridate::year(date),
    month = lubridate::month(date)
    ) %>%
  filter(month != 5) %>%
  group_by(topic,source) %>%
  dplyr::summarise(topicmean = mean(theta)) %>% 
  ungroup() %>%
  mutate(topicmean = round(topicmean,4)) %>%
  spread(source, topicmean) %>%
  select(-topic) %>%
  htmlTable::htmlTable()
AfD B90/GRÜNE Bild.de CDU DIE LINKE DIE WELT FDP FOCUS Online SPD SPIEGEL ONLINE stern.de tagesschau.de ZEIT ONLINE
1 0.0176 7e-04 0.0508 0.007 0.0065 0.0376 0.0077 0.049 0.0177 0.0486 0.0385 0.0378 0.0545
2 0.0146 0.0063 0.038 0.0014 0.004 0.0212 0.0088 0.0285 0.0019 0.0362 0.0165 0.0184 0.0299
3 0.0111 0.0398 0.0243 0.0016 0.0081 0.025 0.0146 0.0231 0.0043 0.027 0.0326 0.0166 0.0339
4 0.0077 0.0062 0.0643 0.0067 0.0178 0.0504 0.0301 0.055 0.0095 0.0551 0.0362 0.0509 0.0573
5 0.0059 0.0021 0.0125 0.0048 0.0027 0.0011 0.006 8e-04 0.005 0.0023 0.0023 0.0012 0.0026
6 0.0209 0.0579 0.0148 0.0505 0.0388 0.0073 0.0499 0.0057 0.0288 0.0116 0.0118 0.034 0.0158
7 0.0169 3e-04 0.0167 0.0073 0.006 0.0161 0.0103 0.0106 0.0093 0.0102 0.0114 0.0148 0.0092
8 0.0079 0.0017 0.0262 4e-04 0.0143 0.0178 0.0023 0.0205 0.0052 0.024 0.0174 0.0204 0.0213
9 0.0185 0.0024 0.0301 0.0043 0.0061 0.0284 0.0113 0.0279 0.01 0.0279 0.0326 0.0179 0.0281
10 0.0151 0.0211 0.0186 0.0422 0.0307 0.0143 0.0142 0.0102 0.0065 0.0355 0.0125 0.0097 0.0111
11 0.005 0.014 0.0311 0.0241 0.0014 0.0248 0.0288 0.0328 0.0199 0.0301 0.0225 0.0258 0.0352
12 0.0151 0.0235 0.0059 0.0063 0.0281 0.0182 0.0013 0.0203 0.0086 0.0088 0.019 0.0124 0.0094
13 0.0141 0.0033 0.0064 0.0056 0.0098 0.0164 0.0048 0.0131 0.0041 0.0065 0.0062 0.0037 0.0084
14 0.0187 0.0074 0.0431 0.0153 0.0125 0.0552 0.0119 0.0382 0.0101 0.0374 0.0507 0.0292 0.0358
15 0.011 0.0044 0.0145 0.0089 0.0061 0.0072 0.0131 0.0049 0.0194 0.0047 0.0024 0.0068 0.0048
16 0.0062 0.0036 0.01 0.0121 0.0083 0.0097 0.0034 0.0101 0.0176 0.0087 0.0127 0.0085 0.0122
17 0.0472 0.0195 0.0271 0.0305 0.0514 0.0181 0.0888 0.0164 0.0353 0.0204 0.0352 0.042 0.028
18 0.0014 0.0039 0.0092 0.0117 0.0024 0.013 0.0019 0.0088 0.0095 0.0086 0.0124 0.0117 0.0096
19 0.1081 0.0019 0.0267 2e-04 7e-04 0.0265 5e-04 0.0197 0.002 0.0202 0.025 0.0161 0.0288
20 0.0455 0.0193 0.0119 0.0114 0.0342 0.0515 0.0303 0.0589 0.0115 0.0246 0.0261 0.0444 0.0117
21 0.012 0.0096 0.0104 0.0606 0.0141 0.0082 0.0173 0.006 0.0479 0.0099 0.0116 0.0121 0.0086
22 0.0071 0.0144 0.013 0.0379 0.01 0.0102 0.019 0.0171 0.0274 0.0179 0.0206 0.0079 0.0124
23 0.0043 4e-04 0.051 0.0028 5e-04 0.0506 0.2278 0.0377 5e-04 0.0461 0.0422 0.0412 0.0538
24 0.0453 0.0299 0.0161 0.0327 0.097 0.0146 0.0193 0.0197 0.0113 0.0315 0.0231 0.0219 0.0145
25 0.0101 0.0257 0.0183 0.0434 0.0151 0.0183 0.0058 0.0233 0.0507 0.0096 0.01 0.0192 0.0076
26 0.0108 0.0016 0.0264 0.0014 0.0038 0.0275 0.0099 0.0255 0.0057 0.028 0.0281 0.0261 0.0345
27 0.0201 0.0234 0.0093 0.0223 0.0436 0.0215 0.0152 0.031 0.0085 0.0419 0.0674 0.0151 0.0194
28 0.0465 0.0046 0.0237 2e-04 6e-04 0.0191 6e-04 0.0203 0.0012 0.0268 0.0268 0.0126 0.0352
29 0.0135 0.0187 0.0147 0.003 0.0246 0.0207 0.0082 0.0217 0.002 0.0171 0.0202 0.0269 0.0229
30 0.0123 0.0205 0.0145 0.0032 0.0413 0.0066 0.0151 0.0075 0.0234 0.0103 0.0076 0.03 0.0105
31 0.0133 0.0028 0.0072 0.0041 0.0021 0.0034 0.0033 0.0064 0.0047 0.0059 0.006 0.0068 0.0029
32 0.0138 0.0109 0.0223 0.0299 0.0233 0.0404 0.0117 0.0308 0.0341 0.0161 0.0296 0.0365 0.0217
33 0.0085 0.027 0.0407 0.0592 0.0125 0.0441 0.0104 0.0399 0.0445 0.0378 0.0192 0.0315 0.0546
34 0.0327 0.023 0.0246 0.0126 0.0125 0.0243 0.005 0.0261 0.0114 0.0323 0.023 0.0309 0.0346
35 0.0126 4e-04 0.0137 0.0158 0.001 0.0112 0.0014 0.0094 0.0011 0.0039 0.0034 0.0065 0.0036
36 0.0114 0.0269 0.0105 0.0405 0.0291 0.0255 0.0169 0.0476 0.054 0.0284 0.0299 0.0525 0.0096
37 0.005 0.0051 0.0094 0.0024 0.0105 0.0095 0.0038 0.0119 0.0071 0.0155 0.0127 0.0096 0.0164
38 0.0125 0.0018 0.033 0.0017 0.0014 0.0056 0.0062 0.0105 0.0028 0.0045 0.004 0.0041 0.0042
39 0.0544 0.0839 0.0118 0.0308 0.0369 0.0207 0.064 0.0181 0.0305 0.0276 0.0263 0.0177 0.0186
40 0.0571 0.0056 0.0083 0.0019 0.0032 0.0147 0.0011 0.0135 3e-04 0.0109 0.012 0.0091 0.0156
41 0.0041 0.004 0.0151 0.0366 0.0052 0.0149 0.0032 0.0149 0.0051 0.0156 0.0127 0.0078 0.0182
42 0.0185 0.0599 0.0116 0.0144 0.0198 0.0082 0.0213 0.0098 0.0279 0.0121 0.0167 0.0171 0.0181
43 0.0343 0.1337 0.0183 0.1098 0.061 0.0263 0.0443 0.0128 0.1003 0.0167 0.0188 0.0339 0.022
44 0.0648 0.001 0.014 0.0191 0.0072 0.0259 0.0165 0.0182 0.0032 0.0142 0.0123 0.0186 0.0253
45 0.0067 0.0025 0.0048 0.0133 0.0149 0.0128 0.0177 0.0141 0.0151 0.0217 0.0235 0.0111 0.0234
46 0.0215 0.0319 0.0279 0.0299 0.0075 0.0216 0.0033 0.0224 0.0223 0.0362 0.0073 0.0167 0.0079
47 0.0161 0.0096 0.0049 0.0021 0.0127 0.0023 4e-04 0.0039 0.0172 0.0029 0.0184 0.0038 0.0045
48 0.0017 7e-04 0.0126 1e-04 4e-04 0.0021 0.0016 0.0043 5e-04 6e-04 8e-04 5e-04 0.0036
49 0.0155 0.1754 0.0276 0.1149 0.1922 0.0264 0.0868 0.0161 0.2006 0.0073 0.0359 0.0472 0.0276
50 0.0051 0.0059 0.0019 0.0012 0.0058 0.0035 0.0025 0.0046 0.0027 0.0021 0.0058 0.0031 0.001

Next, I estimate bivariate correlations between party agendas (press releases) and the mediated party agendas in the online news. These correlations represent the agenda selectivity each party experiences in each media outlet. The higher the correlation, the more congruent both agendas are.

E.g. for “CDU” we get the following correlations:

media <- unique(model_df %>% filter(type == "news") %>% select(source))
parties <- unique(model_df %>% filter(type == "press") %>% select(source))

cor(topicmean[,media$source], topicmean[,"CDU"]) %>% htmlTable::htmlTable()
CDU
DIE WELT 0.111393975907849
stern.de 0.104683612695906
ZEIT ONLINE 0.035627367692623
FOCUS Online -0.009682026527168
Bild.de 0.00426245147510798
SPIEGEL ONLINE -0.00539349393457622
tagesschau.de 0.373370887419879
rm(corrDF)
for (i in parties$source) {
  
  tempdf <- topicmean %>%
    do(data.frame(Cor=t(cor(.[,media$source], .[,i])))) %>%
    gather(medium, cor) %>%
    mutate(party = i,
           medium = gsub("Cor.","",medium))
  
  if (exists("corrDF")){
    corrDF <- rbind(corrDF,tempdf)
  } else {
    corrDF <- tempdf
  }
  
}

agenda <- corrDF %>% 
  dplyr::mutate(medium = ifelse(medium == "DIE.WELT", "DIE WELT", medium),
                medium = ifelse(medium ==  "ZEIT.ONLINE", "ZEIT ONLINE", medium),
                medium = ifelse(medium == "FOCUS.Online", "FOCUS Online", medium),
                medium = ifelse(medium == "SPIEGEL.ONLINE", "SPIEGEL ONLINE", medium)
  )

Correlation of topic prevalence - grouped by party & medium

p <- agenda %>%
  spread(key = party, value = cor) %>%
  ggiraphExtra::ggRadar(aes(color = medium),
                        interactive = T,
                        alpha = 0,
                        rescale = F,
                        legend.position = "bottom") 

htmlwidgets::saveWidget(p, "../figs/radarchart.html")

p
# calculate topic mean by source and month
topicmean <- theta %>%
  mutate(
    year = lubridate::year(date),
    month = lubridate::month(date)
    ) %>%
  group_by(topic,source, month, year) %>%
  dplyr::summarise(topicmean = mean(theta)) %>% 
  ungroup() %>%
  spread(source, topicmean) %>%
  filter(month != 5)
media <- unique(model_df %>% filter(type == "news") %>% select(source))
parties <- unique(model_df %>% filter(type == "press") %>% select(source))
  
rm(corrDF)
for (i in parties$source) {
  
  tempdf <- topicmean %>%
    group_by(month, year) %>%
    do(data.frame(Cor=t(cor(.[,media$source], .[,i])))) %>%
    gather(medium, cor, 3:9) %>%
    mutate(party = i,
           medium = gsub("Cor.","",medium)) %>%
    ungroup()
  
  if (exists("corrDF")){
    corrDF <- rbind(corrDF,tempdf)
  } else {
    corrDF <- tempdf
  }
  
}

agenda <- corrDF %>% 
  mutate(date = as.Date(paste0(year,"/",month,"/1"))) %>%
  dplyr::mutate(medium = ifelse(medium == "DIE.WELT", "DIE WELT", medium),
                medium = ifelse(medium ==  "ZEIT.ONLINE", "ZEIT ONLINE", medium),
                medium = ifelse(medium == "FOCUS.Online", "FOCUS Online", medium),
                medium = ifelse(medium == "SPIEGEL.ONLINE", "SPIEGEL ONLINE", medium)
  ) %>%
  filter(month > 5)

Deviation from average visibility

Due to political relevance, not all potential topics recieve equal amounts of coverage in media. However, these factors should infuence all media outlets equally. To what extent does the topic correlation of a party in a medium differ from the average topic correlation in the media? To calculate the relative topic correlation, I estimate the deviation of the topic correlation of a party in one medium from the average topic correlation of that party over all news paper.

agenda_diff <- agenda %>%
  group_by(party, date) %>%
  # estimate average correlation for each party
  mutate(cor_by_party = mean(cor, na.rm = T)) %>%
  ungroup() %>%
  # estimate average correlation for each party-medium pair
  mutate(
    cor_diff = cor - cor_by_party
    )
p <- agenda_diff %>%
  ggplot(aes(date, cor_diff, color = medium)) +
  geom_line(show.legend = F) +
  geom_hline(yintercept = 0, size = 0.3, color = "grey30", linetype = 2) +
  facet_wrap(~party) +
    scale_color_viridis_d(name = NULL) +
  labs(y=NULL, x =NULL) 

plotly::ggplotly(p, tooltip=c("cor","medium"))
p <- agenda_diff %>%
  group_by(party, medium) %>%
  summarize(cor = mean(cor_diff, na.rm = T)) %>%
  spread(key = party, value = cor) %>%
  ggiraphExtra::ggRadar(aes(color = medium),
                        interactive = T,
                        alpha = 0,
                        rescale = F,
                        legend.position = "bottom") 

p
agenda_diff %>%
  filter(!medium == "tagesschau.de") %>%
  group_by(party, medium) %>%
  summarize(cor = mean(cor_diff, na.rm = T)) %>%
  spread(key = party, value = cor) %>%
  ggiraphExtra::ggRadar(aes(color = medium),
                        interactive = T,
                        alpha = 0,
                        rescale = F,
                        legend.position = "bottom") 

Estimate correlation by date

p <- agenda %>%
  mutate(
    date =as.Date(paste("01",month,year, sep = "-"), format="%d-%m-%Y")
  ) %>%
  ggplot(aes(date, cor, color = medium)) +
  geom_line(show.legend = F) +
  geom_hline(yintercept = 0, size = 0.3, color = "grey30", linetype = 2) +
  facet_wrap(~party) +
    scale_color_viridis_d(name = NULL) +
  labs(y=NULL, x =NULL) 
  # guides(colour = guide_legend(nrow = 1)) +
  # theme(legend.position = "bottom",
  #       legend.title = element_blank())

plotly::ggplotly(p, tooltip=c("cor","medium"))